AIbase
Home
AI Tools
AI Models
MCP
AI NEWS
EN
Model Selection
Tags
Cross-modal reasoning

# Cross-modal reasoning

Gemma 3n E4B It
Gemma 3n is a lightweight and state-of-the-art open-source multimodal model family launched by Google. It is built on the same research and technology as the Gemini model and supports text, audio, and visual inputs.
Image-to-Text Transformers
G
google
1,690
81
Qwen2 VL 2B Instruct
Apache-2.0
Qwen2-VL-2B-Instruct is a multimodal vision-language model that supports image-text-to-text tasks.
Image-to-Text Transformers English
Q
FriendliAI
24
1
Aya Vision 32b
Aya Vision 32B is an open-weight 32B parameter multimodal model developed by Cohere Labs, supporting vision-language tasks in 23 languages.
Image-to-Text Transformers Supports Multiple Languages
A
CohereLabs
387
193
Eilev Blip2 Opt 2.7b
MIT
A first-person perspective optimized vision-language model trained on BLIP-2-OPT-2.7B, employing the innovative EILEV method to stimulate in-context learning capabilities
Image-to-Text Transformers English
E
kpyu
214
4
Layoutlmv3 Base Mpdocvqa
This model is a document visual question answering model fine-tuned on the Multi-page Document VQA (MP-DocVQA) dataset, based on Microsoft's pre-trained LayoutLMv3 model.
Text-to-Image Transformers English
L
rubentito
664
9
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
English简体中文繁體中文にほんご
© 2025AIbase